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Abstract — In this paper we investigate the notion of conditional 
independence and prove several information inequalities for 
conditionally independent random variables. 

Index Terms — Conditionally independent random variables, 
common information, rate region. 



I. Introduction 

Ahlswede, Gacs, Korner, Witsenhausen and Wyner [1], [2], 
[4], [7], [8] studied the problem of extraction of "common 
information" from a pair of random variables. The simplest 
form of this problem is the following: Fix some distribution for 
a pair of random variables a and (3. Consider n independent 
pairs (oix,Px), . . . , (a n , f3 n ); each has the same distribution as 
(a, 0). We want to extract "common information" from the se- 
quences ax, . . . a n and fix, . . . , (3 n , i.e., to find a random vari- 
able 7 such that H(-y\(ax, ■ ■ ■ , a n )) and H(-y\{px, . . . , /?„)) 
are small. We say that "extraction of common information is 
impossible" if the entropy of any such variable 7 is small. 

Let us show that this is the case if a and f3 are independent. 
In this case a n — (ax, ■ ■ ■ , a n ) and (3 n = (/3i, . . . , (3 n ) are 
independent. Recall the well-known inequality 

H(rf) < H(j\a n ) + H( 7 \f3 n ) + I(a n : (3 n ). 

Here I(a n : j3 n ) — (because a" and (3 n are independent); 
two other summands on the right hand side are small by our 
assumption. 

It turns out that a similar statement holds for dependent 
random variables. However, there is one exception. If the joint 
probability matrix of (a, (3) can be divided into blocks, there 
is a random variable r that is a function of a and a function 
of (3 ("block number"). Then 7 = (ti,...,t„) is common 
information of a n and (3 n . 

It was shown by Ahlswede, Gacs and Korner [1], [2], 
[4] that this is the only case when there exists common 
information. 

Their original proof is quite technical. Several years ago 
another approach was proposed by Romashchenko [5] using 
"conditionally independent" random variables. Romashchenko 
introduced the notion of conditionally independent random 
variables and showed that extraction of common information 
from conditionally independent random variables is impossi- 
ble. We prove that if the joint probability matrix of a pair 
of random variables (a, (3) is not a block matrix, then a and 
(3 are conditionally independent. We also show several new 
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information inequalities for conditionally independent random 
variables. 

II. Conditionally independent random variables 

Consider four random variables a, (3, a*, (3*. Suppose that 
a* and f3* are independent, a and j3 are independent given 
a*, and also independent given (3*, i.e., I(a* : (3*) = 0, 
I{a : f3\a*) = and I(a : (3\(3*) = 0. Then we say that a 
and (3 are conditionally independent of order 1. (Conditionally 
independent random variables of order are independent 
random variables.) 

We consider conditional independence of random variables 
as a property of their joint distributions. If a pair of random 
variables a and (3 has the same joint distribution as a pair 
of conditionally independent random variables a and (3 
(on another probability space), we say that a and (3 are 
conditionally independent. 

Replacing the requirement of independence of a* and (3* by 
the requirement of conditional independence of order 1, we get 
the definition of conditionally independent random variables 
(a and (3) of order 2 and so on. (Conditionally independent 
variables of order k are also called fc-conditionally independent 
in the sequel.) 

Definition 1: We say that a and [3 are conditionally inde- 
pendent with respect to a* and (3* if a and (3 are independent 
given a*, and they are also independent given (3*, i.e. I (a : 
(3\a*) = I (a : (3\l3*) = 0. 

Definition 2: (Romashchenko [5]) Two random variables a 
and (3 are called conditionally independent random variables 
of order k (k > 0) if there exists a probability space fl and a 
sequence of pairs of random variables 

(o:o,/3o), (ax,Px), {a k ,f3k) 

on it such that 

(a) The pair (arjj A)) has the same distribution as (a,/?). 

(b) and /3j are conditionally independent with respect to 
a i+ i and (3 i+1 when < i < k. 

(c) (Xk and (3k are independent random variables. 
The sequence 

("OjA))) ("li Pi), (<^k,Pk) 

is called a derivation for (a, (3). 

We say that random variables a and (3 are conditionally 
independent if they are conditionally independent of some 
order k. 

The notion of conditional independence can be applied for 
analysis of common information using the following observa- 
tions (see below for proofs): 

Lemma 1: Consider conditionally independent random 
variables a and (3 of order k. Let a n [f3 n ] be a sequence of 
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independent random variables each with the same distribution 
as a [/?]. Then the variables a n and [3 n are conditionally 
independent of order k. 

Theorem 1: (Romashchenko [5]) If random variables a and 
(3 are conditionally independent of order k, and 7 is an 
arbitrary random variable (on the same probability space), then 

H(rt) <2 k H( 1 \a) + 2 k H( 1 \(3). 

Definition 3: An m x n matrix is called a block matrix if 
(after some permutation of its rows and columns) it consists 
of four blocks; the blocks on the diagonal are not equal to 
zero; the blocks outside the diagonal are equal to zero. 

Formally, A is a block matrix if the set of its first indices 
{1, . . . , m} can be divided into two disjoint nonempty sets 
Ji and I 2 (h U J 2 = {1, . . . , m}) and the set of its second 
indices {l,...,n} can be divided into two sets J\ and J 2 
( Ji U J 2 = {1, . . . , n}) in such a way that each of the blocks 
{ctij : i G h,j G J\} and {ay : i G J 2 ,j € J2} contains 
at least one nonzero element, and all the elements outside 
these two blocks are equal to 0, i.e. <Hj — when G 
(Ji x J 2 ) U (J 2 x Ji). 

Theorem 2: Random variables are conditionally indepen- 
dent iff their joint probability matrix is not a block matrix. 

Using these statements, we conclude that if the joint prob- 
ability matrix of a pair of random variables (a, (3) is not a 
block matrix, then no information can be extracted from a 
sequence of n independent random variables each with the 
same distribution as (a,/?): 

#(7) < 2 k H{ 1 \a n ) + 2 k H{ 1 \(3 n ) 

for some k (that does not depend on n) and for any random 
variable 7. 

III. Proof of Theorem 1 

Theorem 1: If random variables a and (3 are conditionally 
independent of order k, and 7 is an arbitrary random variable 
(on the same probability space), then 

H{ 1 )<2 k H{ 1 \a) + 2 k H{ 1 \f3). 

Proof : The proof is by induction on k. The statement 
is already proved for independent random variables a and (3 
(k = 0). 

Suppose a and (3 are conditionally independent with respect 
to conditionally independent random variables a* and (3* of 
order k — 1. From the conditional form of the inequality 

(7) < H(j\a) + H(j\f3) + I(a : (3) 

(a* is added everywhere as a condition) it follows that 

H(-y\a*) < H(j\aa*) + H(j\f3a*) + I(a : (3\a*) = 

H(~f\aa*) + H(i\0a*) < ff( 7 |a) + H( 7 \0). 

Similarly, H(~/\f3*) < H(j\a) + H{^\(3). By the induction 
hypothesis < 2 n ~ 1 H (j\a*) + 2™- 1 ff( 7 |/3*). Replacing 

H(^\a*) and H(j\(3*) by their upper bounds, we get (7) < 
2 n H{ 1 \a) + 2 n H{ 1 \(3). ■ 



Corollary 1.1: If the joint probability matrix A of a pair 
of random variables is a block matrix, then these random 
variables are not conditionally independent. 

Proof: Suppose that the joint probability matrix A of 
random variables (a, (3) is a block matrix and these random 
variables are conditionally independent of order k. 

Let us divide the matrix A into blocks I\ x J x and I2 x J2 as 
in Definition 3. Consider a random variable 7 with two values 
that is equal to the block number that contains (a, (3): 

7 = 1 a G h & P e Ji; 
7 = 2 a G 7 2 P G J 2 - 

The random variable 7 is a function of a and at the same 
time a function of (3. Therefore, H{-y\a) — and H(j\/3) = 0. 
However, 7 takes two different values with positive probability. 
Hence H{y) > 0, which contradicts Theorem 1. ■ 
A similar argument shows that the order of conditional 
independence should be large if the matrix is close to a block 
matrix. 

IV. Proof of Theorem 2 

For brevity, we call joint probability matrices of condition- 
ally independent random variables good matrices. 

The proof of Theorem 2 consists of three main steps. First, 
we prove, that the set of good matrices is dense in the set of 
all joint probability matrices. Then we prove that any matrix 
without zero elements is good. Finally, we consider the general 
case and prove that any matrix that is not a block matrix is 
good. 

The following statements are used in the sequel. 

(a) The joint probability matrix of independent random 
variables is a matrix of rank 1 and vice versa. In particular, 
all matrices of rank 1 are good. 

(b) If a and (3 are conditionally independent, a 1 is a function 
of a and j3' is a function of j3, then a' and j3' are conditionally 
independent. (Indeed, if a and (3 are conditionally independent 
with respect to some a* and (3*, then a' and (3' are also 
conditionally independent with respect to a* and (3*.) 

(c) If two random variables are fc-conditionally independent, 
then they are /-conditionally independent for any I > k. (We 
can add some constant random variables to the end of the 
derivation.) 

(d) Assume that conditionally independent random vari- 
ables at\ and (3\ are defined on a probability space Oi and 
conditionally independent random variables a 2 and /3 2 are 
defined on a probability space fi 2 . Consider random variables 
(ai,a 2 ) and (/?i,/3 2 ) that are defined in a natural way on 
the Cartesian product Oi x fi 2 - Then (ai,a 2 ) and (/3i,/3 2 ) 
are conditionally independent. Indeed, for each pair (at, fa) 
consider its derivation 

(al$),(al,f3t),...,(al,f3l) 

(using (c), we may assume that both derivations have the same 
length I). 

Then the sequence 

(( a o 1 ,a%(0 o 1 ,0 o 2 )),...,((a[,a l 2 ),(P l 1 ,p l 2 )) 
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is a derivation for the pair of random variables 
((ai, a 2 ), (0i, /3 2 )). For example, random variables 
(ai,a 2 ) = (a?, a 2 ) and (01,02) = (01,02) are independent 
given the value of (a\,a\), because ai and 0i are 
independent given a\, variables a 2 and /3 2 are independent 
given a\, and the measure on fli x f2 2 is equal to the product 
of the measures on f^i and Q 2 . 

Applying (d) several times, we get Lemma 1. 

Combining Lemma 1 and (b), we get the following state- 
ment: 

(e) Let (ai, 0i), . . . ,(a n , n ) be independent and identi- 
cally distributed random variables. Assume that the variables 
in each pair (aii,0i) are conditionally independent. Then 
any random variables a' and 0', where a' depends only on 
ai, . . . , a n and 0' depends only on 0i, . . . , n , are condition- 
ally independent. 

Definition 4: Let us introduce the following notation: 

(1/2-e e \ 

(where < e < 1/2). 

The matrix Diu corresponds to a pair of independent 
random bits; as e tends to these bits become more dependent 
(though each is still uniformly distributed over {0, 1}). 

Lemma 2: (i) D^u is a good matrix. 

(ii) If D £ is a good matrix then D £ n_ £ ) is good. 

(iii) There exists an arbitrary small e such that D £ is good. 

Proof: 

(i) The matrix -D1/4 is of rank 1, hence it is good (indepen- 
dent random bits). 

(ii) Consider a pair of random variables a and distributed 
according to D £ . 

Define new random variables a' and 0' as follows: 
. if (a,0) = (0,0) then (a',0') = (0,0); 
. if (a, 0) = (1, 1) then (a', 0') = (1, 1); 
. if (a, 0) = (0, 1) or (a, 0) = (1, 0) then 

!(0, 0) with probability e/2; 

(0, 1) with probability (1 - e)/2; 

(1, 0) with probability (1 - e)/2; 

(1,1) with probability e/2. 

The joint probability matrix of a' and 0' given a = is 
equal to 

((l-e? e(l-e)\ 
{ e(l-e) e 2 ) 

and its rank equals 1. Therefore, a' and 0' are independent 
given a = 0. 

Similarly, the joint probability matrix of a' and 0' given 
a = 1, = or = 1 has rank 1. This yields that a' and 0' 
are conditionally independent with respect to a and 0, hence 
a' and 0' are conditionally independent. 

The joint distribution of a' and 0' is 

(l/2-e(l-e) 8(1-8) \ 

{ 8(1-8) 1/2-8(1-8))' 

hence D s ( 1 _ s ) is a good matrix. 



(iii) Consider the sequence e„ defined by e = 1/4 and 
8 n +i = 8 n (l — 8 n ). The sequence e„ tends to zero (its limit is 
a root of the equation x = x(l—x)). It follows from statements 
(i) and (ii) that all matrices D £ri are good. ■ 

Note: The order of conditional independence of D £ tends 
to infinity as e — > 0. Indeed, applying Theorem 1 to random 
variables a and with joint distribution D £ and to 7 = a, we 
obtain 

H(a) < 2 k (H(a\a)+H(a\0)) = 2 k H(a\0). 

Here H (a) = 1; for any fixed value of the random variable 
a takes two values with probabilities 2s and 1 — 2e, therefore 

H(a\0) = -(l-2e)Iog 2 (l-2e)-&Iog 2 (2e) - 0(-elog 2 e) 

and (if D £ corresponds to conditionally independent variables 
of order k) 

2 k > H(a)/H(a\0) = l/(9(-elog 2 e) -> 00 
as 8 — > 0. 

Lemma 3: The set of good matrices is dense in the set of 
all joint probability matrices (i.e., the set of m x n matrices 
with non-negative elements, whose sum is 1). 

Proof: Any joint probability matrix A can be approxi- 
mated as closely as desired by matrices with elements of the 
form l/2 N for some N (where N is the same for all matrix 
elements). 

Therefore, it suffices to prove that any joint probability 
matrix B with elements of the form l/2 N can be approximated 
(as closely as desired) by good matrices. Take a pair of random 
variables (a, 0) distributed according to D. The pair (a,0) 
can be represented as a function of N independent Bernoulli 
trials. The joint distribution matrix of each of these trials 
is Do and, by Lemma 2, can be approximated by a good 
matrix. Using statement (e), we get that (a,0) can also be 
approximated by a good matrix. Hence D can be approximated 
as closely as desired by good matrices. ■ 

Lemma 4: If A = (a)ij and B = (b)ij are stochastic 
matrices and M is a good matrix, then A 7 MB is a good 
matrix. 

Proof: Consider a pair of random variables (a,0) 
distributed according to M. This pair of random variables is 
conditionally independent. 

Roughly speaking, we define random variable a' [0'] as a 
transition from a [0] with transition matrix A [B]. The joint 
probability matrix of (a',0') is equal to A T MB. But since 
the transitions are independent from a and 0, the new random 
variables are conditionally independent. 

More formally, let us randomly (independently from a and 
0) choose vectors c and d as follows 

Pr(proj J (c) = j) = a i3 , 
Pr(proj i ((?) = j) = bij, 

where proj^ is the projection onto the i-th component. 
Define a' = proj a (c) and 0' = proj /3 ((i). Then 
(i) the joint probability matrix of (a',0') is equal to 

A T MB; 
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(ii) the pair (a, c) is conditionally independent from the pair 
(f3,d). Hence by statement (b), a' and /?' are conditionally 
independent. 

■ 

Now let us prove the following technical lemma. 

Lemma 5: For any nonsingular nxn matrix M and a matrix 
R = (r)ij with the sum of its elements equal to 0, there exist 
matrices P and Q such that 

1. R = P T M + MQ; 

2. the sum of all elements in each row of P is equal to 0; 

3. the sum of all elements in each row of Q is equal to 0. 
Proof: First, we assume that M — I (here I is the 

identity matrix of the proper size), and find matrices P' and 
Q' such that 

R = P' T + Q'. 
Let us define P' — (p')ij and Q' — (q')ij as follows: 



1 ™ 



k=l 



Note that all rows of Q' are the same and equal to the average 
of rows of R. 

P' = (R- Q'f 

It is easy to see that condition (1) holds. Condition (3) holds 
because the sum of all elements in any row of Q is equal to 
the sum of all elements of R divided by n, which is by the 
condition. Condition (2) holds because 




1 " 



0. 



k=l 



Now we consider the general case. Put P = (M 1 ) T P' 
and Q = M~ l Q'. Clearly (1) holds. Conditions (2) and (3) 
can be rewritten as Pu = and Qu = 0, where u is the 
vector consisting of ones. But Pu = (M~ 1 ) T (P'u) = and 
Qu = M-^Q'u) = 0. Hence (2) and (3) hold. ■ 

By altering the signs of P and Q we get Corollary 5.1. 

Corollary 5.1: For any nonsingular matrix M and a matrix 
R with the sum of its elements equal to 0, there exist matrices 
P and Q such that 

1. R= -P T M -MQ; 

2. the sum of all elements in each row of P is equal to 0; 

3. the sum of all elements in each row of Q is equal to 0. 
Lemma 6: Any nonsingular matrix M without zero ele- 
ments is good. 

Proof: Let M be a nonsingular nxn matrix without 
zero elements. By Lemma 4, it suffices to show that M can 
be represented as 

M = A T GB, 

where G is a good matrix; A and B are stochastic matrices. 
In other words, we need to find invertible stochastic matrices 
A, B such that (A T y 1 MB^ 1 is a good matrix. 

Let V be the affine space of all n x n matrices in which 
the sum of all the elements is equal to 1: 



!}• 



(This space contains the set of all joint probability matrices.) 

Let U be the affine space of all n x n matrices in which 
the sum of all elements in each row is equal to 1: 



U 



{X : ^ Xij = 1 for all i}. 



(This space contains the set of stochastic matrices.) 

Let U be a neighborhood of I in U such that all matrices 
from this neighborhood are invertible. Define a mapping ip : 
U x U — > V as follows: 

ip(A, B) = (A T )~ 1 MB~ 1 . 

Let us show that the differential of this mapping at the point 
A = B = I is a surjective mapping from Tt T j\U x U (the 
tangent space of Ux U at the point (7, /)) to TmV (the tangent 
space of V at the point M). Differentiate at (/,/): 



#1 



A=I, B=I 



d ({A T )- 1 MB- 1 ) = -(dA) T M - MdB. 



We need to show that for any matrix R e T M V, there exist 
matrices (P, Q) e T(i,i)U x U such that 

R = -P T M - MQ. 

But this is guaranteed by Corollary 5.1. 

Since the mapping ip has a surjective differential at (/,/), 
it has a surjective differential in some neighborhood N± of 
(1,1) in U x U. Take a pair of stochastic matrices (A ,B ) 
from this neighborhood such that these matrices are interior 
points of the set of stochastic matrices. 

Now take a small neighborhood 7V 2 of (A ,B ) from the 
intersection of Ni and the set of stochastic matrices. Since 
the differential of ip at (A n ,B ) is surjective, the image of 
N 2 has an interior point. Hence it contains a good matrix 
(recall that the set of good matrices is dense in the set of 
all joint probability matrices). In other words, tp(A\,Bi) = 
(Af)~ 1 MBY 1 is a good matrix for some pair of stochastic 
matrices (A\,Bi) G N 2 . This finishes the proof. ■ 

Lemma 7: Any joint probability matrix without zero ele- 
ments is a good matrix. 

Proof: Suppose that X = (v\,...v n ) is an m x n (m > 
n) matrix of rank n. It is equal to the product of a nonsingular 
matrix and stochastic matrix: 

X = (Vi -Ui- U m - n ,V 2 , ...,V n ,Ui,.. .,U m - n ) X 

( 1 \ 

1 ... 
V 1 ... J 

where u\, . . . , u m - n are sufficiently small vectors with pos- 
itive components that form a basis in M m together with 
vi, . . . ,v n (it is easy to see that such vectors do exist); vectors 
u\, . . . ,u m -n should be small enough to ensure that the vector 
«i — mi — ... — u m - n has positive elements. 
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The first factor is a nonsingular matrix with positive ele- 
ments and hence is good. The second factor is a stochastic 
matrix, so the product is a good matrix. 

Therefore, any matrix of full rank without zero elements is 
good. Ifamxn matrix with positive elements does not have 
full rank, we can add (in a similar way) m linearly independent 
columns to get a matrix of full rank and then represent the 
given matrix as a product of a matrix of full rank and stochastic 
matrix. ■ 

We denote by S(M) the sum of all elements of a matrix 
M. 

Lemma 8: Consider a matrix N whose elements are matri- 
ces Nij of the same size. If 

(a) all contain only nonnegative elements; 

(b) the sum of matrices in each row and in each column of 
the matrix TV is a matrix of rank 1; 

(c) the matrix P with elements pij = S(Nj) is a good joint 
probability matrix; 

then the sum of all the matrices Nj is a good matrix. 

Proof: This lemma is a reformulation of the definition 
of conditionally independent random variables. Consider ran- 
dom variables a* , (3* such that the probability of the event 
(a*, (3*) = is equal to p^, and the probability of the 

event 

a = k, (3 = 1, a* = i, [3* = j 

is equal to the (fc, Z)-th element of the matrix N^. 

The sum of matrices in a row i corresponds to the 
distribution of the pair (a, (3) given a* — i; the sum of 
matrices in a column j corresponds to the distribution 
of the pair (a, (3) given (3* — j; the sum of all the matrices 
corresponds to the distribution of the pair (a, (3). ■ 

From Lemma 8 it follows that any 2x2 matrix of the 

form ^ p |j ^ is good. 1 Indeed, let us apply Lemma 8 to the 
following matrix: 



N 



V 



a 


b/2 \ 








b/2 








c / 



The sum of matrices in each row and in each column is of 
rank 1. The sum of elements of each matrix Nj is positive, 
so (by Lemma 7) the matrix = S(Nj) is a good matrix. 
Hence the sum of matrices Nj is good. 

Recalling that a, b and c stand for any positive numbers 
whose sum is 1, we conclude that any 2 x 2-matrix with 
in the left bottom corner and positive elements elsewhere is a 
good matrix. Combining this result with the result of Lemma 7, 
we get that any non-block 2x2 matrix is good. 

In the general case (we have to prove that any non-block 
matrix is good) the proof is more complicated. 

We will use the following definitions: 

Definition 5: The support of a matrix is the set of positions 
of its nonzero elements. An r-matrix is a matrix with non- 
negative elements and with a "rectangular" support (i.e., with 
support A x B where A[B] is some set of rows[columns]). 

'a, b and c are positive numbers whose sum equals 1. 



Lemma 9: Any r-matrix M is the sum of some r-matrices 
of rank 1 with the same support as M. 

Proof: Denote the support of M by N = AxB. Consider 
the basis Eij in the vector space of matrices whose support 
is a subset of N. (Here is the matrix that has 1 in the 
-position and elsewhere.) 
The matrix M has positive coordinates in the basis . Let 
us approximate each matrix E^ by a slightly different matrix 
E'ij of rank 1 with support N: 



where ei, . . . , e n is the standard basis in K™. 

The coordinates of M in the new basis E[ - continuously 
depend on e. Thus they remain positive if e is sufficiently 
small. So taking a sufficiently small e we get the required 
representation of M as the sum of matrices of rank 1 with 
support N: 




m= 53 

(t.j)eN 



Definition 6: An r-decomposition of a matrix is its expres- 
sion as a (finite) sum of r-matrices M = Mi + M 2 + . . . 
of the same size such that the supports of Mj and M i+i 
intersect (for any i). The length of the decomposition is the 
number of the summands; the r-complexity of a matrix is the 
length of its shortest decomposition (or +oo, if there is no 
such decomposition). 

Lemma 10: Any non-block matrix M with nonnegative 
elements has an r-decomposition. 

Proof: Consider a graph whose vertices are nonzero 
entries of M. Two vertices are connected by an edge iff they 
are in the same row or column. By assumption, the matrix is 
a non-block matrix, hence the graph is connected and there 
exists a (possibly non-simple) path ■ ■ ■ (i m ,j m ) that 

visits each vertex of the graph at least once. 

Express M as the sum of matrices corresponding to the 
edges of the path: each edge corresponds to a matrix whose 
support consists of the endpoints of the edge; each positive 
element of M is distributed among matrices corresponding 
to the adjacent edges. Each of these matrices is of rank 1. 
So the expression of M as the sum of these matrices is an 
r-decomposition. 

■ 

Corollary 10.1: The r-complexity of any non-block matrix 
is finite. 

Lemma 11: Any non-block matrix M is good. 

Proof: The proof uses induction on r-complexity of M. 
For matrices of r-complexity 1, we apply Lemma 7. 

Now suppose that M has r-complexity 2. In this case M is 
equal to the sum of some r-matrices A and B such that their 
supports are intersecting rectangles. By Lemma 9, each of the 
matrices A and B is the sum of matrices of rank 1 with the 
same support. 
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V 



o 

A 2 










Bx 










B 2 



\ 



Suppose, for example, that A = A\ + A 2 + A 3 and B = 
B\ + B 2 . Consider the block matrix 

/ A 1 
A 2 
A 3 





J 



The sum of the matrices in each row and in each column is 
a matrix of rank 1. The sum of all the entries is equal to 
A + B. All the conditions of Lemma 8 but one hold. The only 
problem is that the matrix pij is diagonal and hence is not 
good, where p^ is the sum of the elements of the matrix in 
the (i, j)-th entry (see Lemma 8). To overcome this obstacle 
take a matrix e with only one nonzero element that is located 
in the intersection of the supports of A and B. If this nonzero 
element is sufficiently small, then all the elements of the matrix 



N = 



( A x - 4e 

e 
e 
e 
e 



e 

A 2 - 4e 
e 
e 
e 



e 
e 

A 3 - 4e 
e 
e 



e 
e 
e 

Bx -4e 

e 



B 2 - 4e / 



are nonnegative matrices. The sum of the elements of each of 
the matrices that form the matrix N is positive. And the sum 
of the elements in any row and in any column is not changed, 
so it is of rank 1. Using Lemma 8 we conclude that the matrix 
M is good. 

The proof for matrices of r-complexity 3 is similar. For 
simplicity, consider the case where a matrix of complexity 3 
has an r-decomposition M = A + B + C, where A, B, C are 
r-matrices of rank 1. Let ex be a matrix with one positive 
element that belongs to the intersection of the supports of A 
and B (all other matrix elements are zeros), and e 2 be a matrix 
with a positive element in the intersection of the supports of B 
and C. 

Now consider the block matrix 



N : 



Clearly, the sums of the matrices in each row and in each 
column are of rank 1. The support of the matrix (p)ij is of 
the form 

* * 




and (p)ij has r-complexity 2. 2 By the inductive assumption 
any matrix of r-complexity 2 is good. Therefore, M is a good 
matrix (Lemma 8). 

In the general case (any matrix of r-complexity 3) the rea- 
soning is similar. Each of the matrices A, B, C is represented 
as the sum of some matrices of rank 1 (by Lemma 9). Then 
we need several entries ex (e 2 ) (as it was for matrices of 
r-complexity 2). In the same way, we prove the lemma for 
matrices of r-complexity 4 etc. ■ 

2 Its support is the union of two intersecting rectangles, so the matrix is the 
sum of two r-matrices. 



This concludes the proof of Theorem 2: Random variables 
are conditionally independent if and only if their joint proba- 
bility matrix is a non-block matrix. 

Note that this proof is "constructive" in the following sense. 
Assume that the joint probability matrix for a, (3 is given 
and this matrix is not a block matrix. (For simplicity we 
assume that matrix elements are rational numbers, though 
this is not an important restriction.) Then we can effectively 
find k such that a and (3 are fc-independent, and find the 
joint distribution of all random variables that appear in the 
definition of fc-conditional independence. (Probabilities for 
that distribution are not necessarily rational numbers, but 
we can provide algorithms that compute approximations with 
arbitrary precision.) 

V. Improved version of Theorem 1 
The inequality 

H( 1 )<2 k H( 1 \a) + 2 k H( 1 \f3) 

from Theorem 1 can be improved. In this section we prove a 
stronger theorem. 

Theorem 3: If random variables a and /3 are conditionally 
independent of order k, and 7 is an arbitrary random variable, 
then 

# (7) < 1 k H{i\ a ) + 2 k H( 7 \(3) - (2 k+1 - l)H(7\al3), 

or, in another form, 

7( 7 : a/3) < 2 fe /( 7 : a\0) + 2 k I{ 1 : 0\a). 

Proof: The proof is by induction on k. 
We use the following inequality: 

if (7) = H( 7 \a) + H( 1 \(3)+ 

I(a : (3) - I(a : /3| 7 ) - H(~f\af3) < 

H( 7 \a) + ff( 7 |/3) + I(a : (3) - H( 7 \a(3). 

If a and (3 are independent then I {a : (3) = 0, we get the 
required inequality. 

Assume that a and (3 are conditionally independent with 
respect to a' and /?'; a 1 and (3' are conditionally independent 
of order k — 1. 

We can assume without loss of generality that two random 
variables, the pair (a', (3'), and 7 are independent given (a, j3). 
Indeed, consider random variables (a*,/3*) defined by the 
following formula 

Pr(a* = c, (3* = d\a = a, (3 = b, 7 = g) = 

Pr(a' = c,(3' = d\a = a,(3 = b). 

The distribution of (a, (3, a* ,(3*) is the same as the distribution 
of (a, (3, a' , (3'), and (a*, (3*) is independent from 7 given 

(a,/3). 

From the "relativized" form of the inequality 

H (7) < H( 7 \a) + H( 7 \(3) + I(a : (3) - H(j\a(3) 
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(a' is added as a condition everywhere) it follows that 

H( 7 \a') < 

H(i\aa') + #(7|/3a') + I{a : (3\a') - H(<y\a'a(3) < 
H(j\a) + H(j\(3) - H(j\a'a[3). 

Note that according to our assumption a' and 7 are indepen- 
dent given a and (3, so H{^\a.' aft) = H(^\a(3). 

Using the upper bound for H(j\a'), the similar bound for 
H{^\j3') and the induction assumption, we conclude that 

#"(7) < 2 k H(j\a) + 2 k H(j\/3) 

- 2 k H{ 1 \af3) - (2 k - l)H( 7 \a'(3'). 

Applying the inequality 

H(j\a'l3') > H(j\a'(3'a/3) = H{~f\a{3), 

we get the statement of the theorem. ■ 

VI. Rate Regions 

Definition 7: The rate region of a pair of random variables 
a, (3 is the set of triples of real numbers (u,v,w) such that 
for all e > 0, S > and sufficiently large n there exist 

• "coding" functions t, f and g; their arguments are pairs 
(a™, (3 n ); their values are binary strings of length [(u + 
5)n\, [(v + 5)n\ and [(w + S)n\ (respectively). 
> "decoding" functions r and s such that 

r(t(a n ,P n )J(a n ,0 n )) = a n 

and 

s(t(a n ,(3 n ),g(a n ,p n ))=l3 n 

with probability more then 1 — e. 

This definition (standard for multisource coding theory, see 
[3]) corresponds to the scheme of information transmission 
presented on Figure ^ 

The following theorem was discovered by Vereshchagin. It 
gives a new constraint on the rate region when a and are 
conditionally independent. 

Theorem 4: Let a and (3 be fc-conditionally independent 
random variables. Then, 

H{a) + H{f3) <v + w + (2- 2- k )u 

for any triple (w, v, w) in the rate region. 

(It is easy to see that H (a) < u + v since a 11 can be 
reconstructed with high probability from strings of length ap- 
proximately nu and nv. For similar reasons we have H((3) < 
u + w. Therefore, 

H(a) + H(/3) <v + w + 2u 

for any a and (3. Theorem 4 gives a stronger bound for the 
case when a and (3 are fc-independent.) 
Proof: Consider random variables 

7 = t(a n , (3 n ), £ = /K, /?"), r) = g{a n , f3 n ) 

from the definition of the rate region (for some fixed e > 0). 
By Theorem 1, we have 

H{ 1 )<2 k {H( 1 \a n )+H( 1 \n)- 




a n 13' 



Fig. 1. Values of a n and /3 n are encoded by functions /, t and g and then 
transmitted via channels of limited capacity (dashed lines); decoder functions 
r and s have to reconstruct values a n and f3 n with high probability having 
access only to a part of transmitted information. 

We can rewrite this inequality as 

2- k H{ 1 ) < ff(( 7 , a n )) + H(( 7 , (3 n )) ~ H(a n ) - H(f3 n ) 
or 

H(0 + H( V ) + (2 - 2- k )H(j) > H(0 + H( V )+ 
2#( 7 ) - H ((7, a n )) - £T(( 7> (3 n )) + H(a n ) + H(J3»). 
We will prove the following inequality 

ff(0+#(7)-ff((7,O)>-c£n 

for some constant c that does not depend on e and for 
sufficiently large n. Using this inequality and the symmetric 
inequality 

we conclude that 

H(0+H(r 1 ) + (2-2- k )H( 1 )> 

> H(a n ) + H((3 n ) - 2cen. 

Recall that values of £ are (v + <5)n-bit strings; therefore 
H(£) < (v + S)n. Using similar arguments for r\ and 7 
and recalling that H(a n ) = nH(a) and H{f3 n ) = nH{(3) 
(independence) we conclude that 

(v + S)n + (w + S)n + (2 - 2- k )(u + S)n > 

> nH(a) + nH(/3) - 2cen. 

Dividing over n and recalling that e and 8 may be chosen 
arbitrarily small (according to the definition of the rate region), 
we get the statement of Theorem 4. 
It remains to prove that 

H(0+H( 7 )-H(( 7 ,a n ))>-cen 



for some c that does not depend on e and for sufficiently 
large n. For that we need the following simple bound: 

Lemma 12: Let (i and fj,' be two random variables that 
coincide with probability (1 — e) where e < 1/2. Then 

H{n') < H(n) + 1 + elogm 

where to is the number of possible values of y! . 

Proof: Consider a new random variable a with m + 1 
values that is equal to \i' if \i ^ ft! and takes a special value 
if fi = fi'. We can use at most 1 + e log to bits on average 
to encode a (logm bits with probability e, if fL ^ //, and 
one additional bit to distinguish between the cases /j, = (/ 
and yu ^ /j,'). Therefore, H{a) < 1 + £ log to. If we know the 
values of fi and a, we can determine the value of fi' , therefore 

H(p') <H(p)+ H{a) < H[p) + 1 + e log to. 

■ 

The statement of Lemma 12 remains true if ^ can be 
reconstructed from fj, with probability at least (1 — e) (just 
replace /i with a function of /i). 

Now recall that the pair (7, a") can be reconstructed from £ 
and 7 (using the decoding function r) with probability (1 — e). 
Therefore, H((-f,a n )) does not exceed H((£, 7)) + 1 + cen 
(for some c and large enough n) because both 7 and a" have 
range of cardinality 0(1)™. It remains to note that #((£,7)) < 
H(£)+H(7). " ■ 
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